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BACKGROUND 


The  objective  of  this  study  is  to  evaluate  the  retrieval  effectiveness 
of  the  Technical  Reports  (TR)  database  of  the  Defense  RDT&E  On-Line  System 
(DROLS)  when  a  free  text  generated  index  file  is  used  instead  of  indexer 
assigned  uncontrolled  vocabulary. 

The  documents  in  the  Technical  Reports  database  are  assigned  various 
posting  terras  from  the  thesaurus  or  controlled  vocabulary  (DTIC  Retrieval 
and  Indexing  Termlnology-DRIT) .  In  addition,  indexers  have  the  option  of 
assigning  terms  not  found  in  the  controlled  vocabulary  which  are  known  as 
identifiers  or  open-ended  terras. 


These  terras  have  historically  been  assigned  along  with  the  controlled 
vocabulary,  to  pick  up  topics  where  a  main  idea  or  concept  of  a  report  is 
not  covered  in  the  thesaurus.  An  identifier  was  assigned  to  describe  a 
very  specific  item,  usually  an  alpha-numeric,  which  would  represent  a 
project,  code  name,  equipment  model  number,  etc.  Examples  of  identifiers 
are:  F  104  Fighter,  AN/SPS-39,  and  Plurabob  Project.  Open-ended  terms  have 
been  assigned  to  describe  new  technology  or  concepts,  acronyms,  author 
suggested  terms,  etc.  Previously  a  distinction  was  made  in  the  database  as 
to  whether  a  terra  was  an  identifier  or  an  open-ended  terra,  but  currently 


The  free  text  file  in  the  TR  database  contains  single  words  taken  from 
the  titles  and  abstracts  and  are  directly  searchable.  The  free  text 
inverted  file  consists  of: 

(1)  Alphabetic,  alphanumeric  or  numeric  strings  of  characters  up  to  60 
characters  in  length. 

(2)  All  special  characters  (commas,  periods,  slash  marks,  colons, 
etc.)  are  converted  to  blanks  which  serve  as  term  delimiters. 

(3)  A  term  which  is  present  on  the  stop  word  list  is  discarded  (see 
Attachment  A). 

As  an  example,  the  following  Technical  Report  abstract  will  provide  the 
listed  free  text  terras. 


A  SELECTIVE  DETECTION  SCHEME  FOR  ATOMS  IN  THE  METASTABLE  2S  STATE 
OF  HYDROGEN  THAT  PROVIDES  THE  HIGH  SPATIAL  RESOLUTION  (0.1  CM) 
NECESSARY  FOR  TIME-OF-FLICHT  ATOMIC  BEAM  STUDIES  IS  DESCRIBED. 

THE  SCHEME  UTILIZES  THE  LYMAN  PHOTON  EMITTED  WHEN  THE  METASTABLE 
IS  DE-EXCITED  IN  AN  ELECTRIC  FIELD  VIA  THE  STARK  EFFECT.  DETAILS 
OF  CONSTRUCTION  AND  OPERATION  ARE  DISCUSSED. 
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8  report  will  assess  the  retrieval  effectiveness  of  specific  teres 
seerched  In  both  the  free  text  system  end  as  Identlf lers/open-cndcd  terms 
using  the  records  for  150,000  entries  In  the  Technical  Reports  data  base. 


METHODOLOGY 


A  collection  of  terras  was  put  together  which  are  felt  to  be 
representative  of  typical  terras  that  a  DROLS  user  might  come  up  with  during 
a  search,  not  found  in  the  DT1C  Retrieval  and  Indexing  Terminology.  A 
total  of  212  terms  were  chosen  to  be  searched.  Of  these,  100  to  125  were 
chosen  from  the  "Combined  Frequency  Count"  which  is  a  multivolume, 
alphabetical  listing  of  DRIT  terms  and  identifiers,  along  with  their 
frequency  of  occurence  in  the  DROLS  databases.  These  specific  words  were 
used  in  order  to  assure  a  number  of  search  terras  with  known  hits  as 
identifiers  in  the  Technical  Reports  file.  In  contrast  to  this, 
approximately  75  to  100  words  or  word  phrases  were  chosen  without  reference 
to  the  "Combined  Frequency  Count" .  Most  of  them  relate  in  some  way  to  the 
subject  content  of  the  TR  database.  Also  included  is  a  sampling  of 
subject  areas  not  normally  connected  to  the  Department  of  Defense,  but 
which  may  be  representative  of  certain  needs  of  DROLS  users,  and  of  which 
research  may  have  been  performed  by  DoD. 

The  test  was  done  on  each  individual  term  (a  term  may  be  one  or  more 
words,  or  alphanumerics ,  not  found  in  the  DRIT)  not  on  specific  searches, 
strategies,  or  combination  of  terms. 

After  the  approximately  200  search  terras  were  chosen,  they  were 
individually  searched  in  the  Technical  Reports  database  using  the  terms  as 
indexer  assigned  keywords.  Since  the  free  text  file  was  only  loaded  for  a 
certain  set  of  AD  (Accessioned  Document)  number  ranges  (AD900000-AD924000, 
ADA000001-ADA07 5000 ,  ADB000001-ADB0A5000) ,  the  searches  were 


limited  to  those  ranges  only.  It  was  decided  that  terms  having  up  to  15 
hits  would  be  included  in  the  relevancy  check.  Occurences  greater  than  15 
were  included  in  the  overall  totals,  but  not  in  the  relevancy  count. 
Bibliographies  were  ordered  for  the  search  terms  with  up  to  15  hits.  All 
bibliographies  were  then  checked  for  relevancy  to  the  terra  searched.  If 
there  were  any  questions  as  to  the  relevancy  of  a  specific  item,  a  copy  of 
the  document  itself  was  reviewed  and  rated.  All  items  in  each  bibliography 
were  designated  as  relevant,  marginally  relevant,  or  not  relevant. 

Searches  of  the  same  terras  were  done  on  the  free  text  file.  Terras 
containing  more  than  one  word  were  searched  using  the  Boolean  operator 
"AND"  (in  the  keyword  system  they  had  been  searched  on  one  level  as  a 
single  multiword  index  terra).  A  term  such  as  AGENT  ORANGE  would  be 
searched  in  the  following  manner: 

As  an  identifier- 
@STR@ 

AGENT  ORANGE 
END 

In  the  free  text  file- 

@STR@ 

AGENT 

AND 

ORANGE 

END 

In  the  free  text  test,  if  a  search  term  resulted  in  more  than  15  hits,  a 
qualifying  search  was  done,  performing  a  text  scan  on  the  hits. 
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This  utilized  the  ability  to  string  search  (search  based  on  the  physical 
relationship  of  the  words  in  the  term).  Text  scan  was  done  only  if  the 
term  had  more  than  one  word  or  alphanumeric  grouping  in  it.  Single  words 
were  not  qualified  by  string  searching.  Bibliographies  were  then  ordered 
for  those  terms  having  15  hits  or  less. 

Again,  all  the  bibliographic  references  for  each  term  were  checked  for 
relevancy  to  the  terra  searched,  and  each  item  was  designated  as  relevant, 
marginally  relevant,  or  not  relevant.  In  any  instances  where  the  relevancy 
was  in  doubt,  a  copy  of  the  document  itself  was  looked  at  and  checked. 
Relevancy  statistics  are  presented  only  for  terms  having  15  or  less  hits  in 
both  the  identifier  and  free  text  systems. 


RESULTS 


The  212  terms  searched  produced  a  total  of  334  hits  as  identifiers, 
and  5998  hits  in  the  free  text  system  (this  was  reduced  to  3930  hits  after 
string  searching  of  multiple  word  terms).  Of  these,  52  terms  (24.53%)  had 
no  hits  in  either  system  and  38  terms  (17.92%)  had  greater  than  15  hits  in 
both  systems  and  therefore  not  checked  for  relevancy. 

Of  the  212  terms,  slightly  greater  than  50%  (122  terms)  provided  a 
number  of  hits  (0-15)  which  were  then  checked  for  relevancy  Twelve  terms 
produced  hits  as  identifiers,  with  no  hits  in  the  free  text  system.  Forty 
eight  terms  had  hits  in  the  free  text,  with  none  as  identifiers.  Sixty  two 
terms  resulted  in  hits  in  botli  systems.  In  totaling  these  up  for  the 
relevancy  count,  the  122  terms  searched  as  identifiers  resulted  in  187 
hits,  and  in  the  free  text  system  596  hits.  The  187  hits  from  the 
identifier  searches  consist  of  103  that  were  determined  to  be  relevant 
(55.08%),  73  that  were  marginally  relevant  (39.04%),  and  11  hits  not 
relevant  (5.88%).  In  the  596  free  text  hits,  313  were  found  to  be  relevant 
(52.53%),  217  marginally  relevant  (36.41%),  and  66  hits  not  relevant 


(11.07%). 


DISCUSSION 


The  conclusion  that  one  can  draw  from  the  results  of  the  test  is  that 
the  use  of  the  free  tent  searching  produces  approximately  three  times  as 
many  hits  as  using  identifiers/open-ended  terminology,  with  only  a  slight 
(2.5%)  decrease  in  the  relevancy  of  the  items  retrieved.  There  are 
instances  of  items  not  found  in  the  controlled  vocabulary  where  the  use  of 
free  text  is  beneficial,  such  as  variant  spellings  and  word  forms, 
alphanumer ics ,  chemical  terminology,  foreign  names,  proper  names,  etc.  The 
free  text  searching  technique  becomes  an  additional  means  for  the  search 
analyst  to  augment  the  search  performance  of  the  system.  It  allows  the 
searcher  to  get  at  specifics  that  the  controlled  vocabulary  does  not 
directly  address.  Naturally,  there  are  terms  that  one  would  not  normally 
use  in  a  free  text  system,  where  use  of  the  controlled  vocabulary  and  a 
defined  search  strategy  would  be  necessary  to  narrow  down  the  results.  In 
this  study,  some  of  the  searches  that  were  not  checked  for  relevancy, 
because  the  results  numbered  in  the  hundreds,  would  have  to  be  further 
defined  using  search  alternatives  and  perhaps  some  of  them  would  not 
necessarily  be  searched  using  free  text.  Free  text  searching  provides  a 
viable  alternative  to  the  use  of  uncontrolled  vocabulary  and  In  any 
retrieval  system  can  prove  to  be  a  valuable  tool. 
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ATTACHMENT  B-TOTALS,  STATISTICS  AND  RELEVANCY 
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